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■^^ , In this paper we introduce a general theory for nonlinear sufficient 

fvj ' dimension reduction, and explore its ramifications and scope. This 

theory subsumes recent work employing reproducing kernel Hilbert 
spaces, and reveals many parallels between linear and nonlinear suffi- 
pH ■ cient dimension reduction. Using these parallels we analyze the prop- 

^0 , erties of existing methods and develop new ones. We begin by char- 

acterizing dimension reduction at the general level of <r-fields and 
proceed to that of classes of functions, leading to the notions of suf- 
ficient, complete and central dimension reduction classes. We show 
that, when it exists, the complete and sufficient class coincides with 
the central class, and can be unbiasedly and exhaustively estimated 
by a generalized sliced inverse regression estimator (GSIR). When 
completeness does not hold, this estimator captures only part of the 
central class. However, in these cases we show that a generalized 
sliced average variance estimator (GSAVE) can capture a larger por- 
JVs ' tion of the class. Both estimators require no numerical optimization 

f"**) i because they can be computed by spectral decomposition of linear 

operators. Finally, we compare our estimators with existing methods 
by simulation and on actual data sets. 
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1. Introduction. In this paper we propose a general theory for nonlinear 
sufficient dimension reduction (SDR), develop novel estimators and investi- 
gate their properties under this theory. Along with these developments we 
also introduce a new conditional variance operator, which can potentially 
be used to generalize all second-order dimension reduction methods to the 
nonlinear case. 
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In its classical form, linear SDR seeks a low-dimensional linear predictor 
that captures in full a regression relationship. Imagining a regression setting 
that comprises multiple predictor variables and multiple responses, let X and 
Y be random vectors of dimension p and q. If there is a matrix /3 E R p 
with d < p such that 

(i) yii|/? T i, 

then the subspace spanned by the columns of j3 is called a sufficient dimen- 
sion reduction (SDR) subspace. Under mild conditions, the intersection of all 
such subspaces still satisfies (1), and is called the central subspace, denoted 
by S Y \ X ; see Li (1991, 1992), Li and Duan (1989), Duan and Li (1991), 
Cook and Weisberg (1991), Cook (1994, 1998b). A general condition for the 
existence of the central subspace is given by Yin, Li and Cook (2008). 

Several recent papers have combined sufficient dimension reduction and 
kernels; see Akaho (2001), Bach and Jordan (2002), Fukumizu, Bach and 
Gretton (2007), Wu (2008), Wu, Liang and Mukherjee (2008), Hsing and 
Ren (2009), Yeh, Huang and Lee (2009), Zhu and Li (2011) and Li, Artemiou 
and Li (2011). This proliferation of work, in addition to producing versatile 
methods for extracting nonlinear sufficient predictors, points toward a gen- 
eral synthesis between the notions of sufficiency at the core of SDR and the 
ability to encompass nonlinearity afforded by kernel mappings. To achieve 
this synthesis, explore its many ramifications and broad scope and develop 
new estimators based on it, are the goals of this paper. 

Specifically, we articulate a general formulation that comprises both lin- 
ear and nonlinear SDR, and parallels the basic theoretical developments 
pioneered by Li (1991, 1992) and Cook (1994, 1998a, 1998b). This for- 
mulation allows us to study linear and nonlinear SDR comparatively and, 
somewhat surprisingly, to relax some stringent conditions required by linear 
SDR. For example, a linear conditional mean [Li (1991), Cook (1998b)] is no 
longer needed for unbiasedness, and the sufficient conditions for existence 
and uniqueness of the central subspace are far more general and transparent. 
Finally, our general formulation links linear and nonlinear SDR to the clas- 
sical notions sufficiency, completeness and minimal sufficiency, which brings 
insights and great clarity to the SDR theory. 

Our developments and the sections of this paper, can be summarized 
as follows. In Section 2, we build upon the ideas of Cook (2007) and Li, 
Artemiou and Li (2011) to define an SDR cr-field as a sub cr-field Q of o~(X) 
(the cr-field generated by X) such that Y _LL X\Q, and the corresponding 
SDR class as the set of all square-integrable, (/-measurable functions. Under 
very mild conditions — much milder than the corresponding conditions for 
linear SDR [Yin, Li and Cook (2008)] — we show that there exists a unique 
minimal cr-field Gy\x that satisfies Y _U_ X\Gy\Xi which we call the central 
a -field. The set of all £y|x-measurable, square-integrable functions is named 
the central class. 
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In Section 3, we provide two additional definitions that generalize concepts 
in Cook (1998b), Li, Zha and Chiaromonte (2005) and Li, Artemiou and Li 
(2011): a class of functions is unbiased if its members are ^yi^-measurable, 
and exhaustive if they generate Qy\x- Next, we show that the special class 

(2) L 2 (P X )Q[L 2 (P X )QL 2 (P Y )) 

is unbiased, where L 2 (Px) and L 2 {Py) are the spaces of square-integrable 
functions of X and Y. For reasons detailed in Section 3, we call this class 
the regression class. 

In Section 4, we introduce the complete dimension reduction class: if Q C 
a(X) is a cr-field and for each ^-measurable / € L 2 (Px) we have 

E(f(X)\Y) = almost surely =>■ f(X) = almost surely, 

then we say that the class of ^-measurable functions in L 2 (Px) is complete. 
We prove that when a complete sufficient dimension reduction (CSDR) class 
exists it is unique and coincides with the central class. We further show that 
the CSDR class coincides with the regression class — which is therefore not 
just unbiased, but also exhaustive. 

In Section 5 we establish a critical relationship between the regression 
class and a covariance operator linking X and Y and, based on this, we 
generalize sliced inverse regression [SIR; Li (1991)] to a method (GSIR) 
that can recover the regression class — and hence is unbiased and exhaustive 
under completeness. In Section 6, we consider the case where the central 
class is not complete, so that GSIR is unbiased but no longer exhaustive. 
By introducing a novel conditional variance operator, we generalize sliced 
average variance estimation [SAVE, Cook and Weisberg (1991)] to a method 
(GSAVE) that can recover a class larger than the regression class. Here, the 
situation is similar to that in the linear SDR setting, where it is well known 
that 

(3) SIR subspace C SAVE subspace C Sy\x', 

see Cook and Critchley (2000), Ye and Weiss (2003), Li, Zha and Chiaromonte 
(2005) and Li and Wang (2007). 

In Section 7 we develop algorithms for the sample versions of GSIR and 
GSAVE, and a cross-validation algorithm to determine regularizing parame- 
ters. In Section 8 we compare GSIR and GSAVE with some existing methods 
by simulation and on actual data sets. Section 9 contains some concluding 
remarks. Some highly technical developments are provided in the supple- 
mentary material [Lee, Li and Chiaromonte (2013)]. 

2. Sufficient dimension reduction er-fields and classes. Let (Q, J 7 , P) be 

a probability space and (£lx,Fx), (fiy,^V) and (flxy, Fxy) be measurable 
spaces. For convenience, assume that Qxy = &x x &Y and Txy = Fx x J^Y- 
Let X, Y and (X,Y) be random elements that take values in Clx, ^y and 
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fixy, with distributions Px, Py, Pxy, which are dominated by <7-finite 
measures. Let 

a(X) = X-\F x ), o{Y)=Y-\F Y ), a(X,Y) = (X^Y^^xy), 

and finally let -Pyix('l') '-^~Y x ^x — >■ 7£ be the conditional distribution of Y 
given X . 

Definition 1. A sub cr-field Q of a(X) is an SDR cr-field for Y versus 
X if it satisfies 

(4) YMX\G, 
that is, if Y and X are independent given Q. 

This definition is sufficiently general to accommodate the two cases of 
nonlinear sufficient dimension reduction that interest us the most. The first 
case is when Qx = ^- p and Q,y = M3 for some positive integers p and q, 
and Txi Fy and Txy are Borel er-fields generated by the open sets in 
MP, M. q and R p+<? . Clearly, in this case, the conditional independence (4) is a 
generalization of (1) for linear SDR: if we take Q = a((3 T X), then (4) reduces 
to (1). 

The second case is when X or F, or both of them, are random functions. In 
this case Definition 1 is a generalization of the linear SDR for functional data 
introduced by Ferre and Yao (2003), and Hsing and Ren (2009). Specifically, 
let [a, b] be a closed interval, A the Lebesgue measure and i^(A) the class 
of functions on [a, b] that are square integrable with respect to A. Let Qx = 
£2 (A) and Sly = M. In this case, each X(ui) is a function in L2(A), which, 
depending on applications, could be, say, a growth curve or the fluctuation 
of a stock price. Let hi,...,h<i be functions in Z/2(A). Ferre and Yao (2003) 
considered the following functional dimension reduction problem: 

(5) Y AL X\(X, /n) i2(A) , . . . , (X, h d ) L2{x) . 

This generalizes linear SDR to the infinite-dimensional case, but not to the 
nonlinear case, because (X, /ii)i 2 (A), ••-, {X , hd) l 2 (\) are linear in X. Hsing 
and Ren (2009) considered a more general setting where the sample paths 
{Xf(u}) :t G J} need not lie within L2(A). Still, their generalization is inher- 
ently linear in the same sense that problem (5) is linear. In contrast, our 
formulation in (4) allows an arbitrary sub <7-field of cr(X), which need not be 
generated by linear functionals. Interestingly, as we will see Section 5, it is 
the relaxation of linearity that allows us to remove a restrictive linear condi- 
tional mean assumption used both in Ferre and Yao (2003) (Theorem 2.1), 
and in Hsing and Ren (2009), assumption (IR2). 

The notion of sufficiency underlying SDR, as defined by (1) and (4), is dif- 
ferent from the classical notion of sufficiency because Q is allowed to depend 
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on any parameter in the joint distribution of Pxy ■ For example, Q = a(/3 X) 
depends on the parameter (3 [or rather, the meta-parameter span(/3)] which 
characterizes the conditional distribution of Y\X. Nevertheless, both no- 
tions imply a reduction, or simplification, in the representation of a stochas- 
tic mechanism — the SDR one through a newly constructed predictor, and 
the classical one through a statistic. Indeed, it is partly by exploring and 
exploiting this similarity that we developed our theory of nonlinear SDR. 

Obviously there are many sub u-fields of X that satisfy (4), starting with 
o~(X) itself — which induces no reduction. For maximal dimension reduction 
we seek the smallest such a-field. As in the case of classical sufficiency, the 
minimal SDR c-field does not universally exist, but exists under very mild 
assumptions. The next theorem gives the sufficient condition for the minimal 
SDR (T-field to uniquely exist. The proof echoes Bahadur (1954), which 
established the existence of the minimal sufficient cr-field in the classical 
setting. 

Theorem 1. Suppose that the family of probability measures {Px\y('\u) '■ 
y G Fly} is dominated by a a -finite measure. Then there is a unique sub a- 
field Q* of cr(X) such that: 

(1) YALX\G*; 

(2) if Q is a sub a-field of o~(X) such that Y _IL X\Q , then Q* C Q. 

Proof. Let IT, = Px\y(-\u) an d P= {^-y'-V £ &y}- Since P is dominated 
by a cr-finite measure, it contains a countable subset Q = {Qk '■ k = 1, 2, . . .} 
such that Q = P, where = means two families of measures dominating each 
other. Let {ck : k = 1, 2, . . .} be a sequence of positive numbers that sum to 
1, and let Qq = YlT=i c kQk- Then Qq is a probability measure on &x such 
that {Qo} = Q = P. Let n y = dU y /dQ and Q be a sub a-field of a{X). We 
claim that the following statements are equivalent: 

(1) YALX\G; 

(2) 7T y is essentially measurable with respect to Q for all y G Oy mod- 
ulo Q - 

Proof of 1 => 2. Let B^T X - Then 

E Qo (n y (X)I B (X))=E Yly (I B (X)) = Eu y [En y (lB(X)\g)] 
= E Qo [E Uy (I B (X)\g)7r y (X)]. 

By l,U y (B\g) is the same for all y G fty. Hence U y (B\G) = Qk(B\G) for all 
k, which implies H y (B\Q) = Qq(B\Q). Hence we can rewrite the right-hand 
side of the above equalities as 

E Qo [E QQ (I B (X)\g)TT y (X)]=E Qo [I B (X)E Qo (n y (X)\g)]. 
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Thus the following equality holds for all B S J~x '■ 

E Qo (n y (X)I B (X)) = E QQ [I B (X)E Qo (7r y (X)\g)] 1 

which implies n y (X) = EQ (7r y (X)\Q) a.s. Q Q . 
PROOF of 2 => 1. For any AeQ, 

En y [E Qo (I B (X)\g)I A (X)}=E Qo [E Qo (I B (X)\g)I A (X)n y (X)} 

= E Qo [I B (X)I A (X)E Qo (7r y (X)\G)}. 
By 2, EQ (n y (X)\Q) = n y (X). Hence the right-hand side becomes 

E Qo [I B (X)I A (X)7r y (X)} = E Uy [I B (X)I A (X)}=U y (X eAHB). 

Thus Eq (I b (X)\Q) = Qq{B\Q) is the conditional probability ILy(B\Q), which 
means H y (B\Q) does not depend on y. That is, 1 holds. 

Now let Q* be the intersection of all SDR u-fields Q . Then Q* is itself a 
(j-field. Moreover, since n y is essentially measurable with respect to all SDR 
cr-fields for all y £ Jly, it is also essentially measurable with respect to Q* 
for all y € Sly. Consequently, Q* is itself an SDR a- field, which implies that 
it is also the smallest SDR <7-field. If Q** is another smallest SDR <7-field, 
then we know Q* C Q** and Q** C Q* . Thus Q* is unique. D 

We can now naturally introduce the following definition: 

Definition 2. Suppose that the class of probability measures {Px\y{'\v) '■ 
y G fiy} on £lx is dominated by a u-finite measure. Then we call the a-field 
Q* in Theorem 1 the central cr-field for Y versus X, and denote it by Qy\x- 

Notably, this set up characterizes dimension reduction solely in terms 
of conditional independence. However, explicitly turning to functions and 
introducing an additional mild assumption of square integrability are very 
consequential for further development because they allow us to work with 
structures such as orthogonality and projection. 

Let L,2(Pxy), L>2(Px) and L2(-Py) be the spaces of functions defined on 
fijy, Qx and Sly that are square- integrable with respect to Pxy, Px and 
Py, respectively Since constants are irrelevant for dimension reduction, we 
assume throughout that all functions in I^OPxOj -^(-FV) and L2{Pxy) have 
mean 0. Given a sub <7-field Q of a(X,Y), we use Tig to denote the class of 
all functions / in L,2(Pxy) such that f(X) is ^-measurable. If Q is generated 
by a random vector, say X, then we use 9JIx to abbreviate 9Jl CT m- It can 
be easily shown that, for any G, WHg is a linear subspace of L2(Pxy)- 

Definition 3. Let Q be an SDR u-field and Gy\x be the central cr-field. 
Then Tig is called an SDR class, and Tlg Y . x is called the central class. The 
latter class is denoted by &y\x- 
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The central class, comprising square-integrable functions that are measur- 
able with respect to the central cx-field Qy\x-> represents our generalization 
of the central space Sy\x defined in linear SDR; see the Introduction. 

3. Unbiasedness and exhaustiveness. In linear SDR, the goal is to find 
a set of vectors that span Sy\x- If a matrix 7 satisfies span(7) C Sy\xi we 
say that 7 is unbiased [Cook (1998b)]. If span(7) = S Y \x-, we sa Y that 7 is 
exhaustive [Li, Zha and Chiaromonte (2005)]. Note that when span(7) C 
Sy\Xi 7 X is a linear function of /3 X, where (3 is any matrix such that 
span(/3) =Sy\x'i if span(7) =Sy\Xi then j T X is an injective linear trans- 
formation of /3 T X. In the nonlinear setting, we follow the same logic but 
remove the linear requirement. Part of the following definition was given in 
Li, Artemiou and Li (2011). 

Definition 4. A class of functions in L 2 (Px) is unbiased for & Y \x if 
its members are ^yix-measurable, and exhaustive for Gy\x if its members 
generate Qy\x- 

Next, we look into what type of functions are unbiased. The lemma below 
provides a characterization of the orthogonal complement of 9Jlg that will 
be used many times in the subsequent development. Its proof is essentially 
the definition of the conditional expectation, and is omitted. 

Lemma 1. Suppose U is a random element defined on (Q,J-), Q is a sub 
a -field of cr(U) and f S L 2 (Pjj). Then f is orthogonal to Tig (f _!_ Tig) if 
andonlyifE[f(U)\g]=0. 

Note that _U_ and _L have different meanings: the former means indepen- 
dence; the latter means orthogonality. For two subspaces, say S± and £2, of 
a generic Hilbert space 7i, we use S\ Q 1S2 to denote the subspace Si n S-f ■ 
The following theorem explicitly specifies a class of functions, which we call 
regression class, that is unbiased for &y\x- 

Theorem 2. If the family {H y : y € Cly } is dominated by a a -finite mea- 
sure, then 

(6) L 2 (P X ) e [L 2 (P X ) e L 2 (P Y )] C & Y \x- 

Proof. It is equivalent to show that L 2 {Px)Q < 3y\x Q L 2 (P x ) L 2 (P Y ) . 
If feL 2 (P x ) e&y\x, then, by Lemma 1, E[f(X)\g Y \ x ]=0. Since Q Y \ X is 
a sufficient cr-field, 

E[f(X)\Y} = E[E(f(X)\Y,g Yl x)\Y] = E[E(f(X)\g Yl x)\Y] = 0. 

By Lemma 1 again, / _L 9Jl Y . Because Tly = L 2 (P Y ), we have / G L 2 {Px) 

l 2 (p y ). a 
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The intuition behind the term "regression class" is that Li2{Px) QL^iPy) 
resembles the residual in a regression problem; thus L2(Px) O [Li{Px) © 
Li(Py)\ is simply the orthogonal complement of the "residual class." Hence- 
forth we write the regression class as £y\x- 

4. Complete and sufficient dimension reduction classes. After showing 
that the regression class (2) is unbiased, we investigate under what condi- 
tions it is also exhaustive for the central class &y\x- To this end we need to 
introduce the notion of complete classes of functions in ^(-Px')- 

Definition 5. Let Q C o~(X) be a sub cr-field. The class Tig is said to 
be complete if, for any g G Tig , 

E[g(X)\Y}=0 a.s. P => g{X) = a.s. P. 

Again there are similarities and differences between completeness as de- 
fined here and in the classical setting. A complete and sufficient statistic 
in the classical setting is a rather restrictive concept, often associated with 
exponential families, the uniform distribution, or the order statistics. In 
contrast, completeness here is a rather general concept. To demonstrate this 
point, in the next two propositions we give two examples of complete and 
sufficient dimension reduction classes. In particular, the first shows that if 
Y is related to X through any regression model with additive error, then 
the subspace of L,2(Px) determined by the regression function is a complete 
and sufficient dimension reduction class. In the following, [L2(Px)] 9 denotes 
the (/-fold Cartesian product of L2{Px)- 

Proposition 1. Suppose there exists a function h £ [L2(Px)] q such that 

(7) Y = h(X)+e, 

where e _U_ X and E(e) = 0. Then Tl^x) i- s a complete and sufficient dimen- 
sion reduction class for Y versus X . 

Note that, since L2{Px) is centered, we have implicitly assumed that 
E\h{X)\ = [and hence E(Y) = 0]. However, this does not entail any real 
loss of generality because the proof below can be easily modified for the case 
where L2(Px) is not centered. 

Proof of Proposition 1. Suppose m e Tl h ( X ) and E[m(X)\Y] = 
a.s. P. Then there is a measurable function g :M q — > K such that m = g o 
h. Let U = h(X). Then E(g(U)\Y) = a.s. P. By Lemma 1, for any / G 
L 2 (P Y ), we have E[g(U)f(Y)] = 0. In particular, E[g(U)e itTY ] = 0. Because 
U _U_ e, this implies 

E[g(U)e itTu ]E(e itT£ ) = E[g(fJ)e i{X u e i{ * £ \ = E[g(U)e itTY ] = 0. 
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Hence E[g(U)e lt ] = 0. By the uniqueness of inverse Fourier transformation 
we see that g(U) = a.s. P, which implies m(X) = (goh)(X) = a.s. P. □ 

The expression in (7) covers many useful models in statistics and econo- 
metrics. For example, any homoscedastic parametric or nonparametric re- 
gression, such as the single index and the multiple index models [Ichimura 
and Lee (1991), Hardle, Hall and Ichimura (1993), Yin, Li and Cook (2008)], 
are special cases of (7). Thus, complete and sufficient dimension reduction 
classes exist for all those settings. The next proposition considers a type of 
inverse regression model, in which X is transformed into two components, 
one of which is related to Y by an inverse linear regression model, and the 
other independent of the rest of the data. 

Proposition 2. Suppose q<p, fiy has a nonempty interior, and Py 
is dominated by the Lebesgue measure on ~R q . Suppose there exist functions 
9 G [L 2 (P x )} q and h G [L 2 {Px)Y~ q such that: 

(1) g{X)=Y + e, where Y Me, and e ~ JV(0,E); 

(2) a(g(X),h(X))=a(X); 

(3) h(X)AL(Y,g(X)); 

(4) the induced measure Px ° g 1 is dominated by the Lebesgue measure 

onR q . 

Then $Jt g (x) ^ s a complete sufficient dimension reduction class for Y ver- 
sus X . 

Proof. Assumption 3 implies Y _U_ h(X)\g(X), which, by assumption 2, 
implies Y JL X\g(X). That is, 9R g (x) ^ s an SDR class. Let u 6 9Jl g (x)- 
Then u = v o g for some measurable function v:M q — > R. Let U = g(X). 
Suppose that i£['u([/')|Y'] = almost surely P. Because Y JL e, this implies 
Py ({y : Ev{y + e) = 0}) = 1. In other words, 

R9 W (2tt)9/2|S|1/2 
a.s. Py. This implies 

J ^e-^" 1 '/ V TE ~ 14 dt = => [v(Zs)e- sJ * s / 2 ey Js ds = 

a.s. Py, where s = S~ 1 t. Because Qy contains an open set in M. q and the 
above function of y is analytic, by the analytic continuation theorem, the 
above function is everywhere on IR 9 . Hence, by the uniqueness of inverse 
Laplace transformation, we have 

i>(£s)e~ sTEs/2 = almost surely A, 

where A is the Lebesgue measure on M. q . But, because e~ s Ss ' 2 > 0, we have 
v(Y±s) = a.s. A or equivalently v(t) = a.s. A. By the change of variable 
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theorem, 

J dP X = f dP X og-\ 

Jvog(x)^0 Jv(t)^Q 

By assumption 4, Px ° g" 1 <C A. Hence the above integral is 0, implying 
v o g(x) = a.s. Px, or, equivalently, v o g(X) = a.s. P. D 

Inverse regressions of this type are considered in Cook (2007), Cook and 
Forzani (2009), and Cook, Li and Chiaromonte (2010) for linear SDR. The 
above two propositions show that a complete and sufficient dimension reduc- 
tion class exists for a reasonably wide range of problems, including forward 
and inverse regressions of very general, nonparameterized form. The next 
theorem shows that when a complete and sufficient dimension reduction 
class exists, it is unique and coincides with the central class. Once again, 
the situation here echoes that in classical theory, where a complete and suf- 
ficient statistic, if it exists, coincides with the minimal sufficient statistic; 
see Lehmann (1981). 

Theorem 3. Suppose {Ii y : y € fiy} is dominated by a a -finite measure, 
and Q is a sub a -field of a{X). If Tig is a complete and sufficient dimension 
reduction class, then 

<mg = £ Y \x = &Y\x- 

Proof. If / _L <L Y \x, then by Lemma 1, E(f\Y) = which, because Tig 
is sufficient, implies 

E[E(f\g)\Y]=0. 

Because Tig is complete and because E(f\Q) € Tig, we have E(f\Q) = 0. By 
Lemma 1, this implies / _L Tig. Thus we have proved Tig C <£y\x- However, 
by Theorem 2 we know that £y\x Q ©y|x ^ Tig. This proves the desired 
equality. □ 

5. Generalizations of SIR and their population-level properties. From 
the previous developments we see that the subspace L2(Px) G -^(-Py) of 
J^2{Px) plays a critical role in nonlinear SDR. Its orthogonal complement 
in LziPx) coincides with the central class &y\x under completeness, and 
even without completeness it is guaranteed to be inside &y\x- It turns out 
that this subspace can be expressed as the range of a certain bounded lin- 
ear operators. This representation ensures that estimation procedures can 
rely on simple spectral decompositions, rather than complicated numerical 
optimizations. We first introduce some covariance operators which are the 
building block of this approach. 

5.1. Covariance operators. Since constants are irrelevant here (e.g., / 
and / + 3 can be considered as the same function), we will speak of set 
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relations modulo constants. If A and B are sets, then we say j4CB modulo 
constants if for each / G A there is c G R such that / + c G -B. We say that A 
is a dense subset of B modulo constants if (i) AC B modulo constants and 
(ii) for each / G B, there is a sequence {f n } ^ A and a sequence of constants 
{c n } C R such that {/ n + c n } C A and f n + c n — >• / in the topology for B. 
Let %x and Hy be Hilbert spaces of functions of X and Y satisfying the 
conditions: 

(A) Tix and Tty ar e dense subsets of L2(Px) an d L2(Py) modulo con- 
stants; 

(B) there are constants C\ > and C2 > such that var[/(X)] < C\ ||/||-h x 
and vav[g(Y)] <C 2 \\g\\n Y - 

Although we will later take Tix and Hy to be reproducing kernel Hilbert 
spaces (RKHS), our theory is not restricted to such spaces. In particular, we 
do not require the evaluation functionals [such as / 1— > f(x) from Tix to R] 
to be continuous. 

For two generic Hilbert spaces H\ and %2> let B{%i,'H2) denote the 
class of all bounded linear operators from %i to H21 and let B(H\) ab- 
breviate B(T-Li,'Hi). We denote the range of a linear operator A by ran A, 
the kernel of A by kerA, and the closure of ran A by ran A. Under as- 
sumption (B), the symmetric bilinear form u:Hx x T~ix — > R defined by 
u(f,g) = cov[f(X),g(X)] is bounded and thus induces an operator Mxx G 
B(7-Lx) that satisfies {f,Mxxg)u x =u ifi9)- Similarly, the bounded bilin- 
ear form (/,<?) i—)- cov[f (Y) , g(Y)] from T-Ly x T~iy to R defines an opera- 
tor Myy G B{T~Ly). Let Qx and (?y represent the subspaces ran Mxx and 
ran My y. 

Definition 6. Suppose conditions (A) and (B) are satisfied. We define 
the covariance operators Sxx : Qx — > Qx-, ^yy -Gy -^ Qy and Syx : Qx ~^ 
Qy through the relations 

(f,^xxg}g x = (/>5)l 2 (p x )> (f^YYg)g Y = {f-,9)L 2 {p Y )-> 
{f,^yxg}g Y = (f,g)L 2 (p Y )- 

These operators are essentially the same as those introduced by Fukumizu, 
Bach and Jordan (2004, 2009), except that here we do not assume Hx 
and Tiy to be RKHS. By Baker [(1973), Theorem 1], there is a unique 

operator Ryx G B(Gx,Qy) such that Syx = ^yyRyx^xx- We- call Rxy 
the correlation operator. In order to connect these operators with the central 

1/2 
class, which is an L2(-Px)-object, we need to extend the domains of ^xx 

1/2 
and Syy from Qx and Qy to L2\Px) and L2{Pyj- The following extension 

theorem is important and nontrivial, but since the material presented here 

can be understood without its proof we relegate it to the supplementary 

material [Lee, Li and Chiaromonte (2013)]. 
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Theorem 4. Under assumptions (A) and (B), there exist unique iso- 
morphisms 

t^l : L 2 (P x )^Gx, t 1 ^ : L 2 (Py) ->• Qy 

that agree with ^xx an< ^ ^yy on Gx and Qy in the sense that, for all 
f EQx and geGy, 

tf x (f - Ef) = T% 2 x f, t\i Y {g - Eg) = T&g. 

Furthermore, for any f £ L 2 (P X ), g € L 2 (Py) we have 

(8) (t l J Y g,R YX t][ x f) gY =coy[g(Y)J(X)]. 

The easiest way to understand equality (8) is through the special case 
where f = f — E(f'), g = g' — E(g') where /' G Gx, g' &Qy- In this case, 

{T, Y / Y g,RYx^xxf)g Y = (^YYd', RYX^xx h ')g Y = (9^Yxf')g Y 

= cov[f(X),g(Y)]. 

The theorem also implies that, for all f,g& L 2 (P X ) an d s,iS L 2 (Py), 

(t¥x9^¥xf)g x = {g,f)L 2 (P x ) = cov[g(X)J(X)], 

{t 1 J Y s,t 1 J Y t)g Y = (s,t) L2{PY) =cov[s(Y),t(Y)}. 

5.2. Generalized SIR. The results of the last subsection allow us to char- 
acterize L2(Px) QL2(Py) in terms of extended covariance operators, which 
is the key to developing its estimator. Recall that classical SIR [Li (1991)] 
for linear SDR is based on the matrix 

(9) [var(X)] _1 var[S(X|y)]. 

Under the linear conditional mean assumption requiring that E(X\f3 T X) 
be linear in X for any matrix /3 spanning Sy\ x , the re-scaled "inverse" 
conditional mean [var(X)] -1 .E(X|y) is contained in this space. To generalize 
this to the nonlinear setting, we first introduce a conditional mean operator. 

Definition 7. We call the operator f^ YY ''Ry X Tt xx '■ ^(Px) ->• L 2 (Py) 
the conditional expectation operator, and denote it by E x \y. 

The relation between the conditional expectation operator and condi- 
tional expectations is elucidated by the next proposition, which is followed 
by an important corollary. 

Proposition 3. Under conditions (A) and (B), we have: 

(1) for any f € L 2 (P X ), E x \yf = E(f(X)\Y); 

(2) for any g € L 2 {Py), E*g = E(g(Y)\X). 



NONLINEAR SUFFICIENT DIMENSION REDUCTION 13 

PROOF. For any g G L 2 (Py), 

{ E X\Yf,g)L 2 (P Y ) = &YY R YX^xxf^9) L 2(P Y ) = ( R YX^Xxf ^Yy9)-Hy 

= cov(f(X),g(Y)), 

where the last equality follows from (8). Hence cov(/(X) — (E x i Y f)(Y), 
g(Y)) = 0. By the definition of conditional expectation, E x \ Y f = -^(/POI^Oi 

which proves 1. Assertion 2 follows from the fact that ^ YY anc ^ ^xx are 
isomorphisms, and R YX = R XY- D 

Corollary 1. Under conditions (A) and (B), /or any f,gEL 2 (Px); 
(10) ( 5 ,^ |y i? X | y /} L2(Px) =cov[ J E(o(X)|y), J B(/(X)|y)]. 

Moreover, E x , y E x \y S B(L2(Px)), o-nd ^s norm is no greater than 1. 

Proof. We have 

{9,E X \Y E X\Yf) L 2(Px) = ( E X\Y9,Ex\Yf)L 2 (P x ) 

= (E(g(X)\Y),E(f(X)\Y)) L2{Px) , 

~ 1/2 

which is the right-hand side of (10). Moreover, since ^ xx is isomorphic, we 
have 

E X\Y E X\Y = (^YY R YX y ^ XX )*( y ^YY R YX Y ^ XX ) = ^> XX RxyRyX^ XX - 

~ 1/2 ~ 1/2 ~ 1/2 

Hence ||£ , ^iy£ ; x|y II < ll^xx II ll-^-xy HII-Ryx llll^xxll- Because Xj^ and 
Syj£ are isomorphisms, their norms are both 1. By Baker [(1973), The- 
orem 1], H-Ryxll < 1- Hence ||£ , ^,yi?x|y|| < 1- ^ 

From this corollary we see that the quadratic form 

f^(f,E x]Y E xlY f) L2{Px) , L 2 (P X ) x L 2 (P X )^R 

generalizes the matrix var[i?(X|Y)] of the linear case, which is the essential 
ingredient of SIR for linear SDR. It is then not surprising that the operator 
E x , y E x iy is closely connected to the central class for nonlinear SDR, as 
shown in the following theorem. 

Theorem 5. If conditions (A) and (B) are satisfied and &y\x * s com- 
plete, then 

Yan(E* xlY E x \ Y ) = &y\x- 

Proof. By Lemma 1, / G <t Y \x if and only if / G L 2 {Px) and E(f\Y) = 
0. By Proposition 3, this happens if and only if / G keiE x i Y . This shows 
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ker E x \y = &y\x- However, because ker(Ex\y) = ker(E x , Y E x \Y), we have 
TETi(E xlY E xlY ) = [keriE^yE^y)] 1 = (kevE^y) 1 - = (^ |x ) ± = (L Y \x- 
Since &y\ x is complete, we have £y\ x = &Y\X, as desired. D 

Note that, unlike in classical SIR for linear SDR, here we do not have to 
consider an analogue to the rescaling [var(X)] -1 in (9). This is because the 
L*2 (Px) -inner product absorbs the marginal variance in the predictor vector. 
We refer to the sample estimator based on um(E x , Y E x iy) (see Section 7.2) 
as generalized SIR or GSIR. The GSIR estimator is related to kernel canon- 
ical component analysis (KCCA) introduced by Bach and Jordan (2002); 
see also Fukumizu, Bach and Gretton (2007). In Section 7.2 we will explore 
similarities and differences between these two methods. 

5.3. Kernel SIR. We now turn to another nonlinear SDR estimator, 
which was proposed by Wu (2008) and further studied by Yeh, Huang and 
Lee (2009), called kernel sliced inverse regression (KSIR). In our setting, 
the population-level description of this estimator is as follows. Let 7i x be 
a Hilbert space satisfying (A) and (B) (in this case an RKHS, but this 
assumption is unnecessary). Let T:7i x — > L 2 (Px) be the centering trans- 
formation T(f) = f — E(f). Let Ji,...,Jh be a partition of Qy, and let 
Hi,...,Hh S ranT be the Riesz representations of the linear functionals 

Tj-.TEnT^R, g^ E(g{X)\Y € J t ), i = l,...,h. 

In our language, KSIR uses (the sample version of) the subspace span(£^./ii, 
. . . , T, xx [if,) to estimate &y\x- The next theorem shows that any such rep- 
resentation must be a member of <£y|X) an d thus of &y\x (since £y\x Q 
&y\x) — which implies that KSIR is unbiased. 

Theorem 6. // (A) and (B) hold, then fij <G £y\x- 

Proof. By condition (A), ranT = L 2 {P X )- If / € L 2 (P X ) L 2 (P Y ) C 
ranT, then, by Lemma 1, E(f\Y) = 0. Hence (f,ln) L2 (p x ) = E[f(X)\Y € 
Ji} = 0. □ 

Yeh, Huang and Lee (2009) give another unbiasedness proof for KSIR, but 
they assume that the spanning functions of Hx , say fi, ■ ■ ■ , f m , satisfy the 
linear conditional mean assumption. That is, for any / £ Hx, E(f\fi, . . . , f m ) 

has the form cq + c\f\ -\ \- c m f m for some Co, . . . , c m 6 R. This condition is 

an analogue of the linear conditional mean assumption for linear SDR; see, 
for example, Li (1991) and Cook and Li (2002). Interestingly, our result no 
longer relies on this assumption. The reason Yeh, Huang and Lee need the 
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assumption in the first place is that they define the central class [Definition 1 
of Yeh, Huang and Lee (2009)] as the linear subspace spanned by hi, . . . , hd 
in span(/i, . . . , f m ) such that 

(11) YALX\hi(X),...,h d (X), 

whereas we define the central class as the class of all measurable functions 
of hi, ■ ■ ■ ,hd- Indeed, in the nonlinear setting there is no reason to restrict 
to this linear span formulation, since the conditional independence (11) only 
relies on the <7-field generated by hi,..., h d . 

6. Beyond completeness: Generalized SAVE. We now turn to the more 
general problem of estimating the central class when it is not complete, in 
which case the regression class may be a proper subset of the central class. 
We will generalize SAVE [Cook and Weisberg (1991)] to the nonlinear case 
and show that it can recover functions beyond the regression class. 

The setting here is different from that for GSIR in two respects. First, 
since we now deal with the location-invariant quantity f(X) — E[f(X)\Y], 
we no longer need to define the conditional mean operator through the 
centered L2-spaces Z^OFV) and I^OPx")- Second, we now define relevant 
operators through L2-spaces instead of RKHSs, which is more convenient 
in this context. Let L' 2 (Px) and L 2 (Py) denote the noncentered L2-spaces. 
Define the noncentered conditional mean operator E X , Y ■ L' 2 (Px) ~^ L' 2 (Py) 
through 

(12) (g, E xlY f) L , 2(Px) = E(g(Y)f(X)), f G L' 2 (P x ),g G L' 2 (Py). 

By the same argument of Proposition 3, E' x , y f = E(f(X)\Y). To generalize 
SAVE, we introduce a new type of conditional variance operator. 

Definition 8. For each y G Fly, the bilinear form 

L 2 (Px) x L 2 (P X ) ->M, (f,g) ^ (E' x]Y {fg) - E' x]Y fE x]Y g){y) 

uniquely defines an operator Vx\y{u) £ B(L2(Px)) y i a the Riesz represen- 
tation. We call the random operator 

Vx\y-^y^B{L 2 {Px)), y^V X \ Y (y) 

the heteroscedastic conditional variance operator given Y . 

The operator Vx\y is different from the conditional variance operator 
Tix\y introduced by Fukumizu, Bach and Jordan (2004, 2009). In a sense, 
Ejsfiy is a generalization of .E[var(X|Y)] rather than var(AT|Y), because 
(f,Xx\yf)H x = £[var(/(X)|Y)]. Note that E[vax(f(X)\Y)] becomes 
var(/(A)|y) only when the latter is nonrandom. So £x|Y might be called a 
homoscedastic conditional variance operator. In contrast, {f ,Vx\y f) L^iPx) 
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gives directly the conditional variance var[/(X)|y], hence the term het- 
eroscedastic conditional variance operator. Here, we should also stress that 
E' X , Y is defined between noncentered L' 2 {Px) and L' 2 (Py), whereas Vx\y{y) 
is defined between centered L,2{Px) and Z^OPx")- 

We now define the expectation of a generic random operator A : Vty —> 
B(Li2(Px))- For each / € L2(Px) and x 6 Qx, the mapping y i-> (A(y)f)(x) 
defines a random variable. Its expectation defines a function x >->• 
J Q ,{A(y) f)(x)Py (dy) , which is a member of L2(Px)- Denoting this mem- 
ber as /, we define the nonrandom operator L2(Px) —^ -^2 (-Px ) , / i— >• / as 
the expectation E(A). We now consider the operator 

(13) S = E(V-Vx\y) 2 :L2(Px)^L 2 (Px), 

where V:L2(Px) ~^ ^(Px) is the (unconditional) covariance operator de- 
fined by 

(f,Vg) L2{Px) =cov(f(X),g(X)). 

This operator is similar to T<xx in Section 5 except that it is not defined 
through RKHS. The operator S is an extension of the SAVE matrix [Cook 
and Weisberg (1991)] 

(14) T,~ 1 E[var(X)-vav(X\Y)] 2 T l - 1 . 

Let /3 be a basis matrix of the central subspace Sy\x °f linear SDR. Cook 
and Weisberg show that if E(X\(3 T X) is linear in /3 T X and var(X|/3 T X) is 
nonrandom, then the column space of (14) is contained in Sy\x- The next 
theorem generalizes this result, but without requiring an analogue of the 
linear conditional mean assumption. 

Theorem 7. Suppose that conditions (A) and (B) are satisfied, and 
\ax[f (X)\Qy\x\ is nonrandom for any f € &y\x- Then YEaS C & Y \X- 

PROOF. Let / _L &y\x- We claim that for any y € Uy, 

(15) (f,[V-V xlY (y)]f) L2{Px) =0. 
Because Y JL X\G Y \ X , we have 

y ai (f{X)\Y)=y ax (E(J(X)\g Y \ x )\Y) + E(y ax (f(X)\g Y \ x )\Y). 

Because, by Lemma 1, E(f(X)\Qyi x ) is constant, the first term is 0. Because 
var(f(X)\Q Y \x) is nonrandom, the second term is var(/ (X)\Q Y \ X ) ■ Hence 

var(/(A)|y) = var(/(A)|ey|x)- 
Similarly, 
var(/(X)) = var(£(/(A)|Sy|x)) + £(var(/(X)|£y| X )) = var(/(X)|£y| X ). 
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Therefore vai(f(X)\Y) =var(/(X)), which implies (15). Since V — V x \ Y (y) 
is self-adjoint, (15) implies / € ker V x \ Y (y). Hence 

(f,[V-V xl y(y)] 2 f) L2(Px) = 0. 
Now integrate both sides of this equation to obtain 

{f,(V-V xlY (y)) 2 f) L2{Px) P Y (dy) 



(f, I (V-V xlY (y)) 2 fP Y (dy)) 
\ Jn Y I \ 



LY I L 2 (P X ) 

= {f,(E(V-V xlY ) 2 )f) L2{Px) =0. 
Hence / 6 ker E(V — V X \ Y ) 2 , as desired. D 

Similar to the case of GSIR, we do not need to employ the rescaling by 
X -1 in (14) when generalizing SAVE, because the /^(-PxO-hmer product 
absorbs any marginal variance. We call the estimator derived from TaHS 
(see Section 7.3) generalized SAVE or GSAVE. The next theorem shows 
that GSAVE can recover functions outside £ Y \ X - 

Theorem 8. If conditions (A) and (B) are satisfied, then £ Y \ X C fans'. 

Proof. Since S is self-adjoint, it suffices to show that ker S C £y< x - For 
any / G ker S, 

(f,(V-V xlY ) 2 (y)f)P Y (dy) = 0. 
n Y 

Hence (/, (V - V x \ Y (y)) 2 f) L2(Px) = a.s. P Y , which implies (V-V x \ Y (y))f = 
a.s. P Y . Then 

/ (f,(V-V x]Y (y))f) L2{Px) P Y (dy)=0. 

By Definition 8, the left-hand side is var[/(X)] - E[vax(f(X)\Y)] = 

vai[E{f(X)\Y)]. Hence var[E(f(X)\Y)] = 0, which implies E[f(X)\Y] = 
E[f(X)] = 0. By Lemma 1, we have / G L2(P X ) Q L2(P Y ) = £ Y \ X , as desired. 
D 

Combining Theorems 7 and 8 we see that 

(16) C Ylx CTEnSCe Y{x , 

which is analogous to the relation (3) in the classical setting. Thus we can 
expect GSAVE to discover functions outside the class £ Y \ X , just as we can 
expect SAVE to discover vectors outside the space spanned by SIR. 
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7. Algorithms. We now develop algorithms for the sample versions of 
GSIR and GSAVE, together with a cross-validation scheme to select param- 
eters in the GSIR and GSAVE algorithms. These sample versions involve 
representing the operators in Theorems 5 and 7 as matrices. To formulate 
the algorithms we need to introduce coordinate representations of functions 
and operators, which we adopt with modifications from Horn and Johnson 
[(1985), page 31]; see also Li, Chun and Zhao (2012). 

Throughout this section, A* represents the Moore-Penrose inverse of a 
matrix A, A'" represents (A') a , I n denotes the n x n identity matrix, l n 
denotes the vector in R n whose entries are all 1 and Q = I n — l n l^/n. Let 
kx '■ Qx x ^x -^Kbea positive definite function. Also, let Kx be the n x 
n the Gram matrix {nx(Xi,Xj):i,j = l,...,n}, Gx its centered versions 
QKxQ and Lx the Gram matrices with intercept; that is, Lx = (l n ,Kx) T ■ 
Finally, define Ky , Ky , Gy , Ly in the same manner for Y. 

7.1. Coordinate representation. Let 7i be a finite-dimensional Hilbert 
space with spanning system B = {b\, . . . , b n }. For an / £ H, let [f] B € W 1 
denote the coordinates of / relative to B; that is, / = J27=i(if]i3)ih- Let 
b : Qx —> R n denote the ]R n -valued function (pi, . . . , b n ) . Then we can write 
/ = [/]gfr. Let A:H —> T-L 1 ', where %' is another finite-dimensional Hilbert 
spaces with spanning system C = {c\, . . . , c m } and let c = (c\, . . . , c m ,) T . Then, 

for fen, 

Af = A(b T [f] B ) = (Ah, ..., Ab n )[f] B = (JiAhjc, . . . ,c T [Ab n } c )[f} B . 

Thus, if we let C [A] B = ([Ah] c , . . . , [Ab n ] c ), then Af = c T ( c [A] B )[f] B . In 
other words, 

[Af] c = (c[A] B )[f] B . 

Furthermore, if A\ : %' — > "H" is another linear operator, where H" is a third 
finite-dimensional Hilbert space with spanning system T>, then, by a similar 
argument, 

v[AiA} B = ( v [Ai]c)(c[A] B ). 

Since the spanning systems in the domain and range of an operator are self- 
evident in the following discussion, we will write c[^]s an d [f] B simply as 
[A] and [/]. 

Suppose A G B(H) is self-adjoint. It can be shown that, for any a > 0, 
[A a ] = [A] a . Depending on the choice of the spanning system of 71, it is 
possible that A is invertible and yet [A] is singular, but it is generally true 
that A~ a = [A]^ a . Throughout this section the square brackets [•] will be 
used exclusively for denoting coordinate representations. 



NONLINEAR SUFFICIENT DIMENSION REDUCTION 19 

7.2. Algorithm for GSIR. At the sample level, Px is replaced by the 
empirical measure P n ,x', T~(-x is the RKHS spanned by Bx = {^x(-,Xi), • ■ • , 
Kx(-,X n )} with inner product (f,g)u x = If] 1 Kx[g], where [•] is coordi- 
nate with respect to Bx- The space L2(P n ,x) is spanned by Kx(-,Xi) — 
E n K X (X,Xi), i = l,...,n, with inner product (f,g) L ^ PnX - ) =cov n [f(X), 

g(X)] = n~ 1 [f]K xQK x[g\- The operator Mxx is defined through the rela- 
tion (f,M X xg)Hx =cov n {f{X),g(X)); that is, 

[f] T K x [M XX ] [g] = n" 1 [f] T K x QKx [g] . 

Since [/] and [g] are arbitrary members of M n , the above implies [Mxx] = 
n~ l QKx- Then any / G ran Mxx = Gx can be written as Mxx 9 for some 
g G Hx , which implies [/] = QKx [g] = Q [f] ■ Consequently, for any /, g G Qx , 
(f,g)n x = [f} T Gx[g}- ' 

Let us now find the matrix representations of Sxx> ^yy and Xyx- I n 
the following, hx represents the function n-> (kx(x, X\), . . . ,Kx(x,X n )) T . 
For any / G Gx i we have 

Sxx/ = Mxxf = h T x [Mxx][f} = n- x h\QK x Q{f\ = n^hlGxlf}. 

Hence [Sxx/] = Pxx] [/] = n~ 1 Gx [/]• Since this is true for all [/] G span(Q), 
we have [Sxx] = n~ l Gx- By the same argument we can show that 

[Exx] = n^Gx, [Eyy] = [£yy] = n^Gy, 

(17) 

[Syx] = [Syx] = n^Gx, [E x \y] = G Y G X G^ /2 G^ 2 . 
Theorem 5 suggests that we use r7m(£^., y E'x|y) to estimate ©y|x- Since 
E x , y Ex\Y is an operator on L2(P n ,x) to L2(P n ,x)-> the vectors in 
Iaii(E x , Y E x \y) can be found by 

maximizing (/, E xlY E x \ Y f) L2( P^ x) = \\ E x\Yf\\l 2{P ^ Y) 
subject to 

(/,/)l 2 (p„, x ) = 1- 
The coordinate representation of this problem is 

maximizing [f] T [E x \yV G 2 Y [E x \ Y ][f] subject to [f} T G x [f} = l. 

The optimal solution is [/] = G x 4>, where (ft are the leading eigenvectors of 
the matrix 

, N g x\ e x\y\ G y [Ex\y]G x 

(18) 

Gt /"iV2/"-ytV2/"i /"if /~i2 s-t\ /~t /~i\^- 1^ /-~A l^ r~i\ 
XX X *-J X^v^Y Y X^-J x X X' 

To enhance accuracy we replace the Moore-Penrose inverses G x and G Y 
by the ridge-regression-type regularized inverses (Gx + ex4) _1 and (Gy + 
eyl n )~ l . We summarize the algorithm as follows: 
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(1) Select the parameters j x , JY, ex, ey using the algorithm in Sec- 
tion 7.4. 

(2) Compute the matrix 

(Gx + ex/n)- 3/2 G^ /2 (Gy + ey/ n )- 1 G^(Gy + ey/ n )- 1 G^ /2 (G x + ex/n)- 3/2 

and its first d eigenvectors (f>\, . . . , (j)^ of this matrix. 

(3) Form the sufficient predictors at x (j)J(Gx + £xIn)~ 1 hx{x), i = l,...,d. 

GSIR estimation is similar to the kernel canonical correlation analysis 
(KCCA) developed by Akaho (2001), Bach and Jordan (2002) and Fuku- 
mizu, Bach and Gretton (2007). In our notation, KCCA maximizes 

(g^Yxf)L 2 (p Y ) = [9} T G Y Gx[f} 

subject to (g,T, Y Yg)L 2 (P Y ) = [9] T G Y [f] = 1 and (/, S X x/)l 2 (Px) = IdV x 
G Y [g] = 1- The optimal solution for [/] is [/] = (G x + tln)~ l 4>-, where <j) is 
one of the first d eigenvectors of 

(G x + eI n r 1 G x G Y (G Y + eI n r 2 G Y G x (G x +el n )- 1 . 

We will compare GSIR and KCCA in Section 8. 

7.3. Algorithm for GSAVE. We first derive the sample-level representa- 
tion of the operator V x \ Y (y). The sample version of the noncentered L 2 - 
classes L' 2 (P n ,x) and L' 2 (P n y) are spanned by 

(19) C X = {l,K X (;X 1 ),...,K X (;X n )}, C Y = {l,K Y (;Y 1 ),...,K Y (;Y n )}, 

respectively. Let [•] represent the coordinates relative to these spanning sys- 
terns. Then, for any / € L' 2 (P n>x ), (f(X 1 ), ..., f(X n )) T = L T x [f]. The opera- 
tor E' X ^ Y is defined through the relation (g, E'x\ Y f)v 2 {P^ Y ) = E n(g(Y)f(X)), 
which yields the representation 

(20) [E' xlY ] = (L Y L Y )\L Y L x ). 

Let £ Y denote the function y i-> (1, K Y (y, Y\), . . . ,K Y (y,Y n )) T , and let £ x 
denote the same function of x. For any /, g G L 2 (P n>x ), 

{E xlY (fg)-(E xlY f)(E' xlY g)}(y) 
(21) 

= £ Y (y)[E' xlY }[fg} - [f] T [E xlY ] T £ Y (y)£ Y (y)[E xlY ][g]. 

For any Xj, f(Xi)g(Xi) can be expressed as the ith entry of the vector 
L x [f] & L x [g], which is the same as L X (L X L X )^ L x (L x [f] QL x [g]), where 
is the Hadamard product. Thus we have the coordinate representation 

(22) [fg] = (LxL T x ^Lx(L T x lf]QL x lg}). 
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Substituting (20) and (22) into (21) we see that, for any f,g€ L' 2 {P n ,x)i 

(f,V xlY (y)g) LI{Pnx) = [f] T L x (d^gCy(y)-C Y (y)C^(y))L T x [g] 
(23) 

^[f] T L x A(y)L x [g], 

where C Y {y) = L Y {L Y L Y )U Y {y). 

Let S n : L 2 (P n , x ) ->■ L 2 (P n ,x) be the operator E n (V - V X \ Y {Y)f . By The- 
orem 7, GSAVE is the class of functions ran(S'). At the sample level, this 
corresponds to 

(24) maximizing (f,S n f) L ^ Pn ^ subject to (/,/)x 2 (P n , x ) = 1- 
By (23), for each y G fiy, and f,g G L2{P n ,x)-, we have 

(5, ^|y(y)/>L 2( P„ iX ) = [/] T L x QA(y)g^b]. 

Prom this we deduce that [Vxiy(y)] = (L x QL x /n)^ L x QA(y)QL x . By a 
similar derivation we find [V] = (L x QL x /n)^(L x QL x /n). Hence 

[V - Vx\ Y (y)\ = (L x QL x /n)^L x Q(Q/n - A(y))QL T x . 
It follows that 

(/, S n f) L2{Pnx) = E n {[f] T L x Q(Q/n - A(Y))Q(Q/n - A(Y))QL r x [f}}. 

To find ran(5 n ) we maximize the above subject to [f] T (L x QL x /n)[f] = 1. 
Again we use the regularized inverses instead of the Moore-Penrose inverses 
to enhance performance. The algorithm is summarized as follows: 

(1) Determine "fx,lY,£x,£Y using the algorithm is Section 7.4. 

(2) Compute C = L Y (L Y L Y + ey/ n+ i) _1 ' 2 Ly . Let C{ be the ith column 
of C. Compute Aj = diag(Cj) — C{Cj and then compute Tj = Q/n — Aj for 
i = 1,. . . ,n. 

(3) Compute 

n 

n- 1 Y,(LxQL T x + exIn+iy^LxQTiQTiQLxiLxQLl + e x I n+1 )' 1/2 

i=l 

and the first d eigenvectors of this matrix, say 0i, . . . , <f>d- 

(4) The sufficient predictors' values at any given x G £l x are the set of d 
numbers 

e T x (x)(L x QL T x + e x I n+l y 1/2 Q&, i = l,...,d. 

Here we should mention that, similar to SAVE for linear SDR, GSAVE 
works best for extracting predictors affecting the conditional variance of 
the response, but often not so well for extracting predictors affecting the 



22 K.-Y. LEE, B. LI AND F. CHIAROMONTE 

conditional mean. However, we expect that other second-order methods for 
linear SDR, such as directional regression [Li and Wang (2007)] and the 
minimum discrepancy approach [Cook and Ni (2005)], will be amenable 
to similar generalizations to nonlinear SDR. These will be left for future 
research. 

7.4. Cross-validation algorithm. We now develop a cross-validation scheme 
to determine the parameters 7x , 7y , ex , ey , which are used in the algorithms 
for both the GSIR and the GSAVE. We will only describe the algorithm for 
determining (7x,ex); that of (7y,ey) is completely analogous. 

In the following, for a matrix A, A—ij represents the submatrix of A 
with its ith row and jth column removed, and A— $ »• represents the jth 
column of A with the ith entry removed. Let Cy % = Cy\ {Ky(-, Y;)}, and 
define C x l similarly. Our cross-validation strategy is to predict f{Yi) f° r each 
/ £ Cy l , using the conditional mean operator developed from (C x l ,Cy 1 )- ^ ne 
regularized matrix representation of E' Y i x based on {C x % ,Cy l ) is 

i E Y\x]-(i+l),-(i+l) 

= {(Lx)-(i+i),-i(Lx)-(i+i)-i + exl n }~ (L x )-(i+i) -i{L Y )_ {i+l) _ v 

The feth member /& of Cy % is the function ej.{ly) -U-\.\){') where e^ is the 
vector in R n whose kth. entry is 1 and the remaining entries are 0. Therefore, 
the estimate of E(fk(Y)\X = x) based on on C x l is 

(^x)I (i+ i ) (a:)[-E'y|x/fc]-(i+i)= e I[- E; y|x]-(i+i)-(i+i)(^)-(i+i)( 2; )' 
and the prediction of (fi(Y i ),...,f n (Y i )) T is [E Y ^ x ][ {i+1) _^ +1) (£x)-(i+i)(X i ). 
However, because (£x)-U+i)(^i) is the vector (Lx)-(i+i) i? an d (fi(Y), ■ ■ ■ , 
fn(Yi)) T is the vector (Ly)_( i+1)ii , the difference between (/iQ"i), . . . , f n (Yi)) T 
and its prediction is 

(Ly)-(t+i),i - [E Y \x]-(i+i) ,~(i+i)( L x)~(i+i),i- 

To stress that this difference depends on ^x , ^x , 1y , we denote it by A j (ex , 
7x>7y)- Our cross-validation criterion is defined as CV(7x,ex,7y) = 
Y17=i l|Aj(7x 5 ex,7y)|| 2 - Since the role of 7y is only to determine the set 
of functions to be predicted, we exclude it from the optimization process 
(for the determination of €x,7x)- Moreover, as argued in Fukumizu, Bach 
and Jordan (2009), the parameters ex and 7x have similar smoothing ef- 
fects and only one of them needs to be optimized. For these reasons we fix 
7y and ex at 

(25) l/ 7ro =M Y,\^-Yj\\ exo = 0.01 



I: 


Y 


II: 


Y 


II: 


Y 
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and minimize CV(7x,exO)7yo) over a grid for jx- The grid consists of 20 
subintervals in [7j\:o/3,37xo]) equally spaced in the log scale, where 7x0 is 
calculated using the first formula in (25) with \Yi — Yj | replaced by ||Xj — Xj \\ . 
The rationale for this formula can be found in Li, Artemiou and Li (2011). 
The pair (tk, ey) is selected in the same way, except that eyo is set to 
0.001. This is because Y has dimension 1, so a weaker penalty is needed. 

8. Simulations and data analysis. In this section we present simulation 
comparisons among GSIR, GSAVE, KSIR and KCCA. For the reasons ex- 
plained in the previous section, we compare GSIR with KSIR and KCCA 
in settings where the sufficient predictor appears in the conditional mean, 
and we compare GSAVE with GSIR, KSIR and KCCA in settings where 
the sufficient predictor appears in the conditional variance. We also apply 
GSIR, KSIR and KCCA to two real data sets. 

8.1. Simulation comparisons. To make a comprehensive comparison of 
GSIR, KSIR and KCCA we consider three regression models, namely: 

(X? + Xlf 2 log(Xl+Xlf 2 + e; 
Xi/(l + e* 2 )+e, 

sm(Tr(X 1 + X 2 )/W)+e, 

e_U_X,e~iV(0,0.25),p = 10; 

as well as three distributional scenarios for the predictor vector X, namely: 
(A) independent Gaussian predictors, (B) independent non-Gaussian pre- 
dictors and (C) correlated Gaussian predictors. In symbols: 

A: X~N{0,I p ); 

X ~ (1/2)JV(-1„, J p ) + (l/2)N(l p ,I p ); 
A~A(0,0.6/ p + 0.41 p l T ). 

Note that the central cr-fields for the three models I, II and III are generated 
by X\ + X$, Xi/(1 + e X2 ) and sin(7r(A"i + A 2 )/10), respectively. 

We assess the quality of an estimated sufficient predictor by its closeness 
to the true sufficient predictor and its closeness to the response. Since we are 
only interested in monotone functions of the predictor, we use Spearman's 
correlation as the measure of closeness. For each combination of the models 
and scenarios, we generate n = 200 observations on (X,Y) as the training 
data, and compute the first predicting function using the each of three meth- 
ods. We then independently generate m = 200 observations on (X, Y) as the 
testing data, and evaluate the predicting functions at these points. Finally, 
we compute the mentioned Spearman's correlations from the testing data. 
This process is repeated N = 200 times. In Table 1 we list means and stan- 
dard deviations of the Spearman's correlations computed using the N = 200 
simulated samples. From the table we see that the performances of KCCA 
and GSIR are similar, and both are slightly better than KSIR. 
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Table 1 
Comparison of KSIR, KCCA and GSIR when sufficient predictors appear in the 

conditional means 



Models 


Spearman cor. with true predictor 


Spearman cor. with response 


X Y\X 


KSIR KCCA GSIR 


KSIR KCCA GSIR 



A I 0.78 (0.05) 0.81 (0.04) 0.80 (0.05) 

II 0.81 (0.05) 0.90 (0.03) 0.91 (0.03) 

III 0.76 (0.06) 0.89 (0.04) 0.91 (0.03) 

B I 0.88 (0.02) 0.88 (0.02) 0.87 (0.02) 

II 0.89 (0.03) 0.93 (0.02) 0.93 (0.02) 

III 0.90 (0.02) 0.97 (0.01) 0.97 (0.01) 

C I 0.79 (0.04) 0.82 (0.04) 0.81 (0.04) 

II 0.83 (0.05) 0.86 (0.06) 0.88 (0.04) 

III 0.83 (0.06) 0.96 (0.02) 0.96 (0.02) 



0.63 (0.06) 0.66 (0.05 

0.56 (0.06) 0.61 (0.05 

0.47 (0.07) 0.56 (0.05 

0.82 (0.03) 0.81 (0.03 

0.71 (0.04) 0.74 (0.04 

0.72 (0.04) 0.77 (0.03 

0.64 (0.05) 0.66 (0.05 

0.56 (0.06) 0.59 (0.06 

0.56 (0.06) 0.65 (0.04 



0.64 (0.05) 
0.62 (0.05) 
0.56 (0.05) 

0.80 (0.03) 
0.74 (0.04) 
0.77 (0.03) 

0.65 (0.05) 
0.60 (0.06) 
0.65 (0.04) 



Next, we compare GSAVE, KSIR, KCCA and GSIR when the predictors 
only affect the variance. We use the following models: 



'IV 

V 

VI 



Y 
Y 
Y 



(l/50)(Xf+Xi)e; 
(X l /(l + e^))e, 

and again the scenarios (A), (B) and (C) for the distribution of X. The 
specifications of n,m,N,p are the same as in the previous comparison. 

Because the sufficient predictors appear in the conditional variance 
var(y|X) only, it is less meaningful to measure the closeness between the 
estimated sufficient predictor and the response. So in Table 2 we only re- 
port the means and standard deviations of Spearman's correlations between 
the estimated and true sufficient predictors. We see that GSAVE performs 
substantially better than the other methods. The discrepancy can be ex- 
plained by the fact that KSIR, KCCA and GSIR depend completely on 
E[v&r(f(X)\Y)], whereas GSAVE extracts more information from 
var(/(A)|V). 

8.2. Data analysis. We first consider the faces data, available at http:// 
waldron.stanford.edu/isomap/datasets.html. This data set contains 698 im- 
ages of the same sculpture of a face photographed at different angles and 
with different lighting directions. The predictor comprises 64 x 64 image pix- 
els (thus p = 4096), and the response comprises horizontal rotation, vertical 
rotation and lighting direction measurements (thus q = 3). We use this data 
to demonstrate that the first three sufficient predictors estimated by KCCA 
and GSIR can effectively capture the 3-variate response. We use n = 558 of 
the images selected at random (roughly 80%) as training data, and the re- 
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Table 2 
Comparison of KSIR, KCCA, GSIR and GSAVE when sufficient predictors appear in 

conditional variances 



Models 



Spearman's correlation with true predictors 



X 


Y\X 


GSAVE 


KSIR 


KCCA 


GSIR 


A 


IV 


0.89 (0.08) 


0.10 (0.07) 


0.36 (0.22) 


0.41 (0.23) 




V 


0.73 (0.19) 


0.09 (0.07) 


0.17 (0.13) 


0.20 (0.14) 




VI 


0.84 (0.09) 


0.10 (0.08) 


0.25 (0.17) 


0.27 (0.17) 


B 


IV 


0.87 (0.08) 


0.10 (0.07) 


0.43 (0.25) 


0.53 (0.25) 




V 


0.88 (0.06) 


0.09 (0.07) 


0.11 (0.08) 


0.11 (0.08) 




VI 


0.76 (0.15) 


0.27 (0.11) 


0.61 (0.13) 


0.64 (0.13) 


C 


IV 


0.76 (0.20) 


0.11 (0.07) 


0.23 (0.16) 


0.26 (0.18) 




V 


0.82 (0.14) 


0.10 (0.07) 


0.11 (0.09) 


0.12 (0.09) 




VI 


0.73 (0.15) 


0.15 (0.10) 


0.41 (0.17) 


0.44 (0.17) 



maining m = 140 images as testing data. For each method, we estimate the 
first three predictor functions from the training data, and evaluate them on 
the testing data. The left panel of Figure 1 is the perspective plot of the first 
three KCCA predictors evaluated on the 140 testing images, and the right 
panel is the counterpart for GSIR. We did not include KSIR in this compar- 
ison because in its proposed form it cannot handle multivariate responses. 
The perspective plots indicate that nearby regions in the 3-D cubes have sim- 
ilar patterns of left-right rotation, up-down rotation and lighting direction, 
while distant regions have discernibly different patterns. This reflects the 
ability of the three sufficient predictors to capture the 3-variate responses. 



KCCA 




GSIR 




Fig. 1. First 3 sufficient predictors by KCCA (left panel) and GSIR (right panel), com- 
puted from 558 training images, and evaluated on 14-0 testing images — faces data. 
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KSIR 




Fig. 2. First 3 sufficient predictors by KSIR (upper-left panel), KCCA (upper-right 
panel) and GSIR (lower panel), computed on 1000 training images and evaluated on 1000 
testing images — handwritten digits data. 



Next, we apply KSIR, KCCA and GSIR to the handwritten digits data, 
available at http://www.cs.nyu.edu/~roweis/data.html. This data set 
contains 2000 images of p = 16 x 16 pixels showing handwritten digits from 
to 9 — the response is thus categorical with 10 levels. We use 1000 images as 
training data and 1000 as testing data. Again, for each method we estimate 
the first three sufficient predictors on the training data, and evaluate them 
on the testing data. Results are presented in the three perspective plots in 
Figure 2 — for visual clarity, these plots include only 100 randomly selected 
points from the 1000 in the testing data. The plots show that all three 
methods provide low-dimensional representations in which the digits are 
well separated. 

9. Concluding remarks. In this article we described a novel and very 
general theory of sufficient dimension reduction. This theory allowed us to 



NONLINEAR SUFFICIENT DIMENSION REDUCTION 27 

combine linear and nonlinear SDR into a coherent system, to link them with 
classical statistical sufficiency, and to subsume several existing nonlinear 
SDR methods into a unique framework. 

Our developments thus revealed important and previously unexplored 
properties of SDR methods. For example, unbiasedness of various nonlinear 
extensions of SIR proposed in recent literature was proved under the strin- 
gent linear conditional mean assumption. We were able to show that these 
methods are all unbiased under virtually no assumption, and that GSIR is 
exhaustive under the completeness assumption. We were also able to show 
that nonlinear extensions of SIR are in general not exhaustive when com- 
pleteness is not satisfied, and that in these cases GSAVE can recover a larger 
portion of the central class. These insights could not have been obtained 
without paralleling linear and nonlinear SDR as allowed by our new theory. 

In addition to achieving theoretical synthesis and important insights on 
SDR methods, we introduced a new heteroscedastic conditional variance 
operator — which is more general than the (homoscedastic) conditional vari- 
ance operator in Fukumizu, Bach and Jordan (2004, 2009). This operator 
was crucial to generalizing SAVE to the nonlinear GSAVE, and thus to ex- 
ploit dependence information in the conditional variance to improve upon 
the performance of the nonlinear extensions of SIR. We have no doubt that 
the heteroschedastic conditional variance operator can be used to generate 
nonlinear extensions of other second-order SRD methods such as contour 
regression [Li, Zha and Chiaromonte (2005)], directional regression [Li and 
Wang (2007)], SIR-II [Li (1991)] and other F2M methods [Cook and Forzani 
(2009)]. These extensions will be the topic of future work. 

More generally, it is our hope that the clarity and simplicity that classical 
notions lend to the formulation of dimension reduction, as well as the trans- 
parent parallels we were able to draw between linear and nonlinear SDR, 
will provide fertile grounds for much research to come. 

As we put forward a general theory that encompasses both linear and 
nonlinear SDR, it is also important to point out that linear SDR has its 
special values that cannot be replaced by nonlinear SDR via kernel mapping, 
one of which is its preservation of the original coordinates and as a result 
its strong interpretability. For example, when mapped to higher dimension 
spaces, kernel methods can sometimes interpret difference in variances in the 
original coordinates as location separation in the transformed coordinates, 
which can be undesirable depending on the goal and emphasis of particular 
applications. For further discussion and an example of this point, see Li, 
Artemiou and Li (2011). 
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nonlinear sufficient dimension reduction for functional data is inspired by 
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SUPPLEMENTARY MATERIAL 

Supplement to "A general theory for nonlinear sufficient dimension re- 
duction: Formulation and estimation" (DOI: 10. 1214/12- AOS1071SUPP; 
.pdf). This is supplementary appendix that contains some techincal proofs 
of the results in the paper. 
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